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ABSTRACT 

Determining  the  effectiveness  of  a  computer  simulation  model  in 
duplicating  a  desired  real  world  phenomenon  is  an  important  unsolved 
problem.  The  purpose  of  this  paper  is  to  model  the  validation  pro- 
cedure in  a  broad  context  and  develop  a  general  methodology  for  the 
statistical  part  of  validation.  A  procedure  calling  on  utility, 
decision,  simulation,  and  statistical  theories  is  developed.  The 
goals  of  statistical  testing  are  presented,  and  the  assumptions,  prop- 
erties, and  results  of  several  parametric  and  nonparametric  tests  are 
discussed  and  compared. 
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I.   INTRODUCTION 

With  the  advent  of  complex  computer  simulations  of  real  world  phe- 
nomena, a  means  of  judging  the  worth  or  validity  of  the  simulation  has 
become  yery   important,  yet  to  associate  with  any  simulation  model  a 
strict  valid-invalid  judgement  is  quite  misleading.  Models  can  be  con- 
sidered valid  under  certain  circumstances  and  invalid  under  others,  or 
when  compared  using  different  criteria  they  may  be  considered  first 
valid  then  invalid. 

In  the  past  one  of  the  largest  problems  of  model  validation  has  been 
the  definition  of  the  term.  Normally  what  has  been  thought  of  as  model 
validation  is  the  statistical  testing  of  collected  data  and  the  compari- 
son of  the  results  with  a  predetermined  test  criterion.  For  this  reason 
validation  has  come  under  attack  for  being  a  form  of  statistical  chica- 
nery and  merely  a  means  to  add  credence  to  an  already  accepted  model. 

The  purpose  of  this  paper  is  to  redefine  the  validation  procedure 
in  a  larger  context  and  to  describe  some  of  the  problems,  techniques, 
and  assumptions  associated  with  the  statistical  portion  of  validation. 
It  is  hoped  that  with  this  procedure  model  validation  will  become  a  more 
definitive  process  and  that  the  mystical  air  normally  associated  with 
statistical  testing  procedures  will  be  removed. 


II.  VALIDATION  PHILOSOPHY 

The  present  procedure  of  validation  is  basically  as  follows.  After 
the  requirement  to  validate  a  model  has  been  given,  agreement  on  a  sig- 
nificance level  for  statistical  testing  is  reached.  The  data  is  then 
given  to  a  statistician  to  find  an  appropriate  testing  procedure.  Using 
the  predetermined  level  of  significance  it  is  then  determined  whether  the 
difference  between  the  real  world  and  model  data  is  significant.  The 
decision-making  procedure  is  quite  simple,  if  the  results  of  the  test 
indicate  a  significant  difference  then  the  model  is  said  to  be  invalid. 
If  the  difference  is  not  significant  then  it  is  considered  valid.  This 
procedure  is  shown  in  Figure  1  and  is  the  one  used  in  a  recent 
validation  [12]. 

This  type  of  apparently  straightforward  validation  has  two  basic 
problems.  The  decision  rule  while  seemingly  well  defined  is  actually 
more  complex  and  involves  such  things  as  cost  and  utility  models  as  well 
as  statistical  theory.  As  an  example,  in  a  validation  done  by  the 
Systems  Analysis  Group  [23]  the  level  of  significance  was  set  at  a  level 
of  .5  in  order  to  make  the  probability  of  accepting  an  invalid  model 
small.  This  decision  must  have  involved  consideration  of  the  costs  of 
accepting  an  invalid  model  and  of  rejecting  a  valid  model.  After  com- 
piling these  costs  the  principles  of  utility  theory  must  have  been  used 
in  arriving  at  a  figure  of  .5  as  the  best  level  of  significance.  But, 
none  of  these  considerations  were  mentioned  in  the  report  of  the  valida- 
tion. So  in  the  past,  and  even  on  present  validation  projects,  the 


*  Number  in  brackets  is  reference  number  as  listed  on  pages  50  and  51 
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FIGURE  1 
A  PRESENTLY  USED  MODEL  VALIDATION  SCHEME 
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decision  rules  are  not  fully  explained,  but  should  be  if  meaningful 
validation  is  to  be  achieved.  The  second  problem  is  that  much  confusion 
exists  in  the  method  of  selection  of  a  particular  statistical  test.  In 
very  few  cases,  if  ever,  will  a  test  be  perfect  for  the  data.  An  assump- 
tion will  therefore  be  relaxed  slightly  to  make  use  of  a  strong  property 
of  a  test,  yet  if  another  test  were  used  that  assumption  might  not  have 
to  be  violated.  The  question  of  which  assumptions  and  properties  of  a 
test  are  most  important  is  \/ery   complex  and  the  answers  not  clearly 
defined.  Thus  the  properties  and  goals  of  statistical  tests  must  be 
more  completely  defined.  It  must  also  be  realized  that  while  the  passing 
or  failing  of  a  single  test  or  at  the  most  of  a  few  tests  constitutes  a 
decision  rule  now,  the  results  of  these  tests  should  only  correspond  to 
a  single  element  of  an  n  dimensional  decision  vector. 

The  philosophy  of  the  present  validation  procedure  is  sound.  What  is 
proposed  is  a  new  procedure,  rather  than  philosophy,  directed  at  allowing 
the  decision  maker  more  flexibility.  This  involves  describing  decisions 
in  terms  of  utility,  simulation,  and  decision  theory  as  well  as  just 
collected  data  and  statistics. 

The  procedure  can  be  thought  of  as  the  interchange  of  information 
between  three  modules,  Simulation  Theory,  Statistical  Theory,  and 
Decision-Utility  Theory.  Within  each  module  there  are  several  nodes 
such  as  the  testing  node  in  the  statistical  module.  Several  nodes  such 
as  the  criterion  node  share  modules.  In  general  the  procedure  would  work 
as  follows.  Information  concerning  the  model  and  real  world,  such  as 
data,  flow  from  the  state  of  nature  node  through  the  validation  informa- 
tion node  and  into  the  decision  node.  From  the  decision  node  several 
paths  exist.  The  validation  may  be  terminated,  by  accepting  or  rejecting 
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the  model,  the  model  may  be  temporarily  rejected  while  further  data  is 
gathered  or  additional  comparisons  conducted,  or  the  decision  may  be  to 
compare  the  information  by  means  of  a  statistical  test.  Regardless  of 
the  choice,  if  the  validation  is  not  terminated  more  information  enters 
the  validation  information  node  as  a  result  of  the  decision  and  the  pro- 
cess will  continue  in  a  cycling  manner  until  the  decision  to  terminate 
the  validation  is  given.  Figure  2  illustrates  the  concept  of  modules  and 
nodes  interacting  to  form  a  validation  procedure. 
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PROPOSED  VALIDATION  SCHEME 
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III.  VALIDATION  DECISIONS 

There  are  four  possible  outcomes  of  any  decision  rule.  These  are 
based  on  the  two  decisions: 

D0  =  Decision  the  model  is  valid 

D,  =  Decision  the  model  is  invalid 
and  the  two  possible  underlying  states  of  nature: 

Sfi  =  The  model  is  valid 

S,  =  The  model  is  invalid  . 
One  possible  outcome  is  DQS-,  or  the  incorrect  decision  that  the  model  is 
valid  when  it  is  invalid.  The  other  outcomes  are  D~S0,  D,S«,  and  D,S-,. 
Why  the  decision  maker  makes  a  particular  decision  depends  upon  the 
decision  rule  he  is  using.  As  an  example  of  a  simple  decision  rule  con- 
sider a  validation  procedure  which  is  similar  to  the  one  presently  being 
used.  It  consists  of  one  model,  the  two  states  of  nature  S~  and  S, ,  and 
the  two  decisions  D~  and  D, .  The  statistical  test  is  exact,  that  is: 

Pr(X*  =  xQ|S0)  =  1 

Pp(X*  =  x^S-,)  =  1 
or  when  the  state  of  nature,  S,  is  SQ  then  the  test  statistic  X*  is  xQ 
and  similarly  when  S  =  S,  ,  then  X*  =  x, .  The  simple  decision  rule  is: 
if  X*  =  xQ  then  D  =  Dfi  and  if  X*  =  x,  then  D  =  D, .  Again  note  that  this 
is  basically  what  is  done  in  present  validations.  A  test  is  performed 
and  according  to  the  results  of  the  test  alone  the  model  is  said  to  be 
valid  or  invalid. 

Expanding  the  above,  consider  the  following  procedure  consisting  of 
the  same  model  and  states  of  nature.  The  test  is  no  longer  exact.  Now, 
when  S  =  SQ,  X*  is  a  random  variable  with  density  function  Pn(x)  and  when 
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S  =  S,  then  X*  is  a  random  variable  with  density  function  p-,(x).  X* 
represents  the  test  statistic  whose  range  is  the  real  line  X.  Since  X* 
is  now  a  random  variable  the  decision  rule  may  become  more  complex  but 
the  possible  results  are  still  the  same. 
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FIGURE  3 
RESULTS  OF  DECISION  RULES  AND  THEIR  PROBABILITIES 

The  probabilities  of  making  the  decisions  are: 

a  =  Probability  of  rejecting  a  valid  model 
0  =  Probability  of  accepting  an  invalid  model 
1-a  =  Probability  of  accepting  a  valid  model 
1-3  =  Probability  of  rejecting  an  invalid  model. 
As  an  example  of  another  decision  rule  consider  a  case  where, 
pn(x)  is  normal  (0,a2) 
p,(x)  is  normal  (u,a2)  y  >  0. 
Let  XQ  be  that  portion  of  the  real  line,  X,  such  that  all  points  in  X 


o 


are  less  than  or  equal  to  X  ,  while  the  other  points  constitute  X,,  thus 


16 


XQ  =  {x  :  x  <  X  } 

X1  =  {x  :  x  >  X+} 
X  is  an  arbitrary  point  and  can  be  determined  either  before  or  after 
data  is  collected  depending  upon  the  decision  rule.  If  X  is  deter- 
mined before  the  test  is  performed  then  the  decision  rule  might  be 
that  if  X*  is  in  X,  then  D, ,  otherwise  DQ.  With  this  decision  rule  the 
corresponding  decision  probabilities  are: 

p  (DJS,)  =   3  =    p,(x)dx  =  probability  of  accepting  an  invalid 


model 


Pr(D! |SQ)  =   a 
Pr(D0|SQ)  =  1-a 


pQ(x)dx  =  probability  of  rejecting  a  valid 


model 


pQ(x)dx  =  probability  of  accepting  a  valid 


model 


p  (D,  |S,)  =  1-3  =    p-.(x)dx  =  probability  of  rejecting  an  invalid 
r  '   '         Xl         model 

Figure  4  shows  the  functions  graphically  with  X  and  the  probabili- 
ties, a,  and  3. 


P-l  (x) 


* 


FIGURE  4 

GRAPHIC  DISPLAY  OF  POSSIBLE  DISTRIBUTIONS  OF  TEST  STATISTIC 
ASSUMING  pQ(x)  =  n(0,a2)  AND  p^x)  =  n(u,a2) 

It  should  be  noted  that  Figure  4  represents  a  much  simplified  pair 
of  density  functions.  The  nature  of  validation  and  the  associated  test 
statistics  often  prevents  any  knowledge  of  the  exact  distribution  of 
p,(x).  In  only  one  of  the  tests  performed  in  this  paper  can  3  be 
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readily  determined.  Another  complicating  feature  of  validation  is  that 
p,(x)  usually  flanks  pQ(x)  or  even  overlaps  PQ(x)  over  its  entire 
domain.  Figure  5  shows  these  possibilities. 
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FIGURE  5 
POSSIBLE  INTERACTIONS  BETWEEN  pQ(x)  AND  p^x) 

In  both  parts  a)  and  b)  of  Figure  5  the  previous  decision  rule  could 
have  been  used,  but  by  associating  costs  with  the  various  results  of  a 
decision  rule,  a  payoff  matrix  can  be  formed  and  another  type  of  deci- 
sion rule  employed. 

Utility  theory  and  cost  structures  will  determine  the  values  of  the 
decision  matrix  but  in  general  the  S,DQ  element  will  represent  the 
highest  cost  for  it  represents  the  acceptance  of  an  invalid  model  and 
thus  all  subsequent  decisions  based  on  the  assumption  that  the  model  is 
valid  will  also  be  in  error.  SQDQ  will  usually  cost  the  least  but  the 
ordering  of  S-.D-,  and  S^D-.  will  vary   depending  on  the  costs  of  additional 
experimentation  and  realignment  of  the  model. 
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With  the  costs  of  each  decision  result  determined,  and  the  decision 
maker  willing  to  accept  a  priori  knowledge  of  P  (S  =  SQ)  and  P  (S  =  S, ) , 
then  he  can  by  using  a  rule  such  as  minimax  chose  a  decision  which  will 
minimize  cost.  If  through  information  received  from  the  statistical 
module  he  is  willing  to  accept  values  of  a  and  6  then  the  costs  of  the 
decisions  will  become  expected  costs  and  by  using  the  same  minimax  deci- 
sion rule  the  minimum  expected  cost  can  be  found.  Thus  with  this  deci- 
sion rule,  the  relationship  between  the  statistical  and  decision  modules 
can  be  seen  as  one  in  which  additional  information  received  from  the 
data  is  transmitted  to  the  decision  maker  allowing  him  access  to  more 
information  about  the  model  and  helping  him  to  refine  his  decision. 

The  choice  of  which  decision  rule  to  use  is  a  subject  in  itself  and 
is  left  for  future  study.  Regardless  of  the  method  chosen  though,  the 
value  of  statistical  information  from  the  data  is  apparent. 

How  to  realign  the  simulation  model,  when  the  decision  to  reject  it 
is  made,  is  the  subject  of  simulation  theory.  This  also  is  a  complex 
field  and  left  for  future  study. 
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IV.  STATISTICAL  THEORY  MODULE  AND  TESTS 

Two  measures  of  the  probability  of  rejecting  a  valid  model  are  a   and 
P  where  P  is  determined  by  finding  the  largest  value  of  a  for  which  the 
null  hypothesis  that  the  model  and  real  world  are  sampling  from  the  same 
distribution  can  be  accepted  given  the  test  statistic.  Thus  P  repre- 
sents the  value  of  a  at  which  the  decision  concerning  the  null  hypothe- 
sis passes  from  acceptance  to  rejection  and  is  determined  after  the 
test  statistic  is  computed,  whereas  a  is  arbitrarily  predetermined  and 
used  to  compare  with  the  results  of  the  test.  When  comparing  different 
tests,  two  approaches  could  be  used.  Either  an  a  level  of  significance 
could  be  predetermined  and  each  test  receive  a  pass  or  fail  rating,  or 
a  P  value  could  be  determined.  For  more  sensitive  comparisons  P  values 
will  be  determined  when  applying  data  to  tests  in  the  following 
sections. 

The  statistical  module  of  model  validation  operates  as  follows. 
Real  world  data  is  compared  with  model  simulation  data  by  one  of  many 
statistical  tests.  For  each  test  the  pQ(x)  is  known  and  the  value  of 


th 


e  test  statistic  computed.  Given  pQ(x)  and  the  test  statistic  the  P 


value  is  determined  along  with  3  if  P-.U)  is  known.  This  information 
is  then  passed  into  the  test  results  node  for  further  transmission  into 
the  validation  information  node.  If  a  was  predetermined  and  the  test 
statistic  X*  fell  into  the  critical  region,  i.e.,  P  was  less  than  a, 
then  based  on  that  particular  test  the  decision  that  the  model  be 
accepted  cannot  be  endorsed.  The  decision  to  reject  or  accept  a  model 
could  be  thought  of  as  an  n  dimensional  vector  of  which  the  test  result 
is  merely  one  component. 
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The  types  of  statistical  test  previously  used  for  model  validation, 
and  most  likely  to  continue  being  used,  are  both  parametric  and  non- 
parametric.  Before  describing  several  of  these  tests,  a  description  of 
their  inherent  differences  should  be  useful. 

A.   PARAMETRIC  AND  NONPARAMETRIC  TESTS 

A  parametric  statistical  test  is  a  test  in  which  specific  assump- 
tions, such  as  u  =  0  or  u  =  y  ,  about  the  parameters  of  the  sampled 

1      2 

population  are   made,  whereas  a  nonparametric  test,  as  the  name  implies, 
makes  no  assumptions  about  the  value  of  the  parameters  in  the  sampled 
population  but  rather  assumes  only  that  a  distribution  exists.  Another 
term  often  used  interchangeably  with  nonparametric  is  distribution  free. 
A  distribution  free  test  differs  from  both  the  parametric  and  nonpara- 
metric tests  in  that  it  makes  no  assumptions  about  the  form  of  the 
sampled  distribution.   In  this  paper  the  terms  nonparametric  and  distri- 
bution free  will  be  used  interchangeably.  More  important  than  the  dif- 
ference in  the  definitions  is  the  difference  in  the  assumptions  which 
must  be  made  when  testing  parametrically  vice  nonparametrically.  To 
determine  critical  values  both  tests  require  that  the  distribution  of 
the  test  statistic  be  fully  known.   In  the  case  of  the  parametric  tests 
this  often  requires  that  the  sample  size  be  large  so  that  the  asymptotic 
distribution  of  the  test  statistic  is  known.  The  distribution,  pn(x), 
of  the  test  statistic  in  the  nonparametric  case  is  generally  known  pre- 
cisely and  need  not  be  assumed.  Other  assumptions  of  the  parametric 
tests  may  include  independence  of  observations,  underlying  normal  dis- 
tribution of  the  sampled  populations,  homoscedasticity  or  at  least, 
known  ratio  of  variances  among  populations  in  the  case  of  a  multiple 
sample  test,  and  that  the  data  is  measured  in  at  least  an  interval 
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scale,  meaning  that  operations  with  the  data  are  isomorphic  to  arithmetic. 

The  assumptions  associated  with  nonparametric  tests  include  only  that 

sampled  populations  be  continuous,  and  in  some  cases,  be  symetric  or 

identical.  As  with  parametric  tests,  the  observations  are  assumed  to 

be  independent.  For  a'more  complete  discussion  of  the  assumptions  see 

[2,21]. 

The  more  practical  advantages  of  the  nonparametric  tests  include 

their  intuitive  attraction,  simplicity  of  derivation,  and  ability  to 

be  understood  conceptually.  They  are  often  times  easier  to  apply,  but 

this  quality  deteriorates  rapidly  as  the  sample  size  increases  past  30. 

Perhaps  the  largest  advantage,  however,  is  their  statistical  efficiency.* 

As  Bradley  explains  [3]: 

When  judged  by  the  mathematical  criterion  of  statistical  effi- 
ciency, distribution-free  tests  are  often  superior  to  their 
most  efficient  parametric  counterparts  when  both  tests  are 
applied  under  "nonparametric"  conditions,  i.e.,  conditions 
meeting  all  assumptions  of  the  distribution-free  test,  but 
failing  to  meet  some  of  the  assumptions  of  the  parametric 
test.  When  both  tests  are  applied  under  "parametric"  condi- 
tions, i.e.,  conditions  meeting  all  assumptions  of  the  para- 
metric test,  and  therefore  of  both  tests,  distribution-free 
tests  are  usually  very   slightly  less  efficient  at  small  sample 
sizes,  becoming  increasingly  less  efficient  as  sample  size 
increases. 

Thus  with  large  samples  the  parametric  tests  are  more  powerful  provided 

that  their  assumptions  are  met.  This  margin  of  power  enjoyed  by  the 

parametric  tests  decreases  with  sample  size  until  the  sample  size  becomes 

small  enough,  6-10,  that  the  power  differential  is  insignificant.  On  the 

other  hand,  when  the  parametric  assumptions  are  falsely  made,  but  the 


*  Power  or  statistical  efficiency  is  defined  as  the  ratio  of  the  para- 
metric test  sample  size  to  the  nonparametric  test  sample  size  in  order 
to  make  the  power  of  the  two  tests  equivalent.  If  the  power  efficiency 
of  a  nonparametric  test  is  96%,  then  if  the  more  powerful  parametric 
test  has  10  samples  the  nonparametric  test  must  have  only  10/. 96  =  10.4 
samples  to  be  of  equal  power. 
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nonparametric  assumptions  are  not,  the  nonparametric  tests  are  often 
more  superior. 

Since  one  of  the  underlying  assumptions  in  validation  is  that  the 
sample  size  of  real  world  data  will  be  quite  small  it  is  necessary  to 
consider  the  effect  of  the  parametric  and  nonparametric  assumptions  in 
terms  of  small  samples,  i.e.,  less  than  10.  Again  according  to  Bradley, 
when  the  parametric  assumptions  are  violated  they  have  their  most 
drastic  effect  and  in  addition  are  most  unlikely  to  be  detected  due  to 
the  small  sample  size.  If  a  parametric  test  can  be  used,  even  though 
it  is  more  powerful,  its  advantage  over  the  nonparametric  test  is  slight 
due  to  the  small  sample  size. 

Because  of  the  many  facets  of  both  types  of  tests  it  would  be  foolish 
to  say  that  only  tests  of  a  single  type  should  be  used.  Equally  as 
foolish  would  be  an  attempt  to  categorize  the  types  of  data  to  be 
validated  with  specific  statistical  tests.  This  choice  remains  in  the 
decision  node  of  the  validation  procedure.  So  rather  than  attempt  such 
a  recipe  it  is  beneficial  to  look  at  what  has  been  done  in  several 
validations  and  what  the  differences  in  critical  regions  or  P  values  are 
when  tests  requiring  various  assumptions  are  performed  on  the  same  data. 
In  order  to  examine  these  differences,  a  sensitivity  analysis  on  two 
sets  of  data  with  respect  to  statistical  tests  and  their  inherent 
assumptions  was  performed. 
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V.  NONPARAMETRIC  TESTS  WITH  SUBMARINE  DATA  BASE 

The  data  used  in  the  nonparametric  tests  of  this  section  was  obtained 
from  a  sequence  of  submarine  exercises  in  which  a  submarine  attempted  to 
detect  another  submarine  transitting  through  a  defined  region.  If 
detection  was  made  then  both  the  range  and  aspect  of  the  detected 
submarine  were  recorded.  A  stern  aspect  indicates  a  retreating  contact 
and  the  corresponding  range  would  be  negative.  A  positive  detection 
range  indicates  a  bow  aspect  at  initial  detection  of  an  incoming  sub- 
marine. Thus  for  a  given  exercise  the  data  might  be:  out  of  10 
possible  detections  initial  detection  was  made  at  8,  3,  -6,  and  1  miles. 
This  should  be  interpreted  as  follows:  the  exercise  was  run  10  times 
and  detection  occurred  on  4  runs.  On  these  runs  the  initial  detection 
was  made  when  the  transitting  submarine  was  at  a  range  of  8,  3,  and 
1  miles  and  closing,  while  on  the  fourth  run  detection  was  made  at  6 
miles  but  the  range  was  opening.  These  exercises  were  simulated  on 
the  computer  and  similar  results  tabulated.  Summarizing,  the  following 
constitutes  the  data  base  for  this  validation.  The  submarine  model  was 
tested  under  10  various  conditions  such  as  speed  and  depth.  Calling 
each  set  of  conditions  an  input,  there  are  10  distribution  functions 
each  of  which  corresponds  to  an  input.  For  each  of  the  inputs  there 
are  several  samples  from  the  real  world  exercises  and  many,  100-120, 
samples  from  the  computer  simulation  model.  Two  measures  of  effective- 
ness have  been  observed  namely  the  frequency  of  detection  and  range  of 
initial  detection. 

Once  aqain,  the  goal  of  the  statistical  module  is  to  determine 
pn(x),  and  p,(x),  the  size  of  the  critical  region  or  P  value,  and  $. 
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Where  the  distribution  of  the  test  statistic  under  the  null  hypothesis 
that  the  real  world  and  model  are  sampling  from  the  same  distribution  is 
p0(x),  and  p-j(x)  is  the  distribution  of  the  test  statistic  when  the  two 
distributions  are  not  the  same. 

A.  TESTS 

The  tests  chosen  to  illustrate  what  might  be  done  in  validating  the 
submarine  model  include  the  nonparametric  Kolmogorov-Smirnov  Test,  the 
Fisher  Exact  Test,  the  Wilcoxon  Test,  and  the  tests  used  in  the  initial 
validation  of  this  model  [24]. 

B.  KOLMOGOROV-SMIRNOV  TEST 

Perhaps  the  most  heuristic  of  the  statistical  tests  is  the  Kolmogorov- 
Smirnov  Two  Sample  Test  often  referred  to  as  the  Smirnov  Maximum  Devia- 
tion Test.  The  test  statistic  is  the  maximum  deviation  between  the  two 
empirical  cumulative  distribution  functions. 

To  compute  the  test  statistic  rank  the  n  real  world  and  m  model 
observations  and  give  each  a  subscript  corresponding  to  its  rank.  For 
each  possible  rank  i,  i=l,...,n+m,  calculate  d..  Where: 

r.   s . 

d.  =-i--L 

i   n    m 

and 

r.  =  the  number  of  real  world  observations  less  than  the  ith 
order  statistic 

s.  =  the  number  of  simulated  observations  less  than  the  ith 
1   order  statistic. 

The  test  statistic,  D,  is  max  |d,|,  i=l,...,n+m.  Under  the  hypothesis 

that  the  observations  came  from  the  same  distribution,  the  distribution 

of  D  is  known  and  can  be  calculated  for  any  combination  of  n+m  [4,20]. 
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As  an  illustration  of  this  test  consider  the  10th  input  where  n+m 
is  106,  and  n=3.  The  ranges  of  real  world  detections  were  ranked  among 
the  model  detections  and  the  values  of  i  are  i  =  7,26,38.  The  two  step 
functions  are  shown  in  Table  I. 


TABLE  I 

CUMULATIVE  STEP  FUNCTIONS  IN  K0LM0G0R0V-SMIRN0V 
TEST  WITH  INPUT  10  OF  SUBMARINE  DATA 


710     20     30     1*0     50     60     70     80     90    100 


1  1  0 


D  occurs  at  i  =  38  where 

d38  =  3/3  -  35/106  «  .67. 
Using  the  approximation  to  pQ(x),  the  P  value  is  found  to  be  .2024. 
Table  II  gives  a  summary  of  the  P  values  when  the  test  was  applied  in  a 
similar  fashion  with  the  remaining  9  inputs. 

This  test  has  all  the  previously  mentioned  advantages  of  nonpara- 
metric  statistics,  especially  intuitive  appeal.  It  also  has  the 
advantage  of  testing  for  differences  in  the  distributions  caused  by  all 
the  properties  of  the  distribution  function  instead  of  just  the  dif- 
ferences in  mean  or  variance. 

A  major  restriction  is  placed  on  the  validity  of  the  results  by 
using  the  approximation  to  pn(x).  Hodges  [15]  has  shown  that  as  m  and  n 
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increase  the  approximation  may  differ  significantly  from  pQ(x),  thus  not 
only  does  power  efficiency  decrease  with  increase  sample  size  but  the 
approximation  of  pfl(x)  also  becomes  less  valid.  The  effect  on  P  of 
approximating  the  distribution  function  can  be  shown  by  comparing  the 
results  of  this  approximation  with  those  of  an  exact  test.  This  is 
shown  in  Table  XVII,  page  45. 

TABLE  II 

SUMMARY  OF  RESULTS  FOR  K0LM0G0R0V-SMIRN0V 
TEST  WITH  SUBMARINE  MODEL  DATA 

INPUT  P  VALUE 

1  .485 

2  .260 

3  .980 

4  .998 

5  .941 

6  .491 

7  .922 

8  .577 

9  .792 
10  .202 

C.  WILCOXON  TEST  IN  THE  ORIGINAL  VALIDATION 

In  the  original  validation  of  the  submarine  encounter  model  and  the 
associated  data  [26],  the  Wilcoxon  Test  was  used  with  the  initial  range 
of  detection  data.  In  this  test  the  observations  from  both  sources  for 
a  given  input  are  aggregated  and  ranked  in  order  of  magnitude.  If  both 
model  and  real  world  are  sampling  from  the  same  distribution  then  all 


27 


combinations  of  ranks  are  equally  likely.   The  smallest  and 
largest  possible  rank  sums  make  up  the  critical  region  since  they 
represent  the  least  likely  results.  For  example,  suppose  the  real 
world  detections  occurred  at  1 ,  5,  8  miles  out  of  6  runs  and  out  of 
20  runs  the  simulation  results  had  initial  detection  at  6,  10,  12,  14, 
20  miles.  The  summed  ranks  for  the  real  world  would  be  1+2+4=7  and 
at  an  alpha  level  of  .109  the  difference  between  real  world  and  model 
results  is  significant. 

Table  III  lists  the  acceptance  regions  in  ranked  sums  for  various 
model  and  real  world  outputs.  In  each  case  the  level  of  significance 
is  .109  and  the  real  world  ranks  are  to  be  summed.  Exact  P  values  were 
not  found  in  the  original  validation. 

There  are  two  difficulties  with  the  original  use  of  the  test 
however.  In  an  apparent  attempt  to  avoid  the  tedious  counting  proce- 
dures outlined  in  the  next  section,  the  model  data  was  divided  into 
sets  of  20  then  tested  with  the  set  of  real  world  data.  Each  test  was 
considered  independently,  thus  the  level  of  significance  is  considered 
to  be  (1-.109)6  or  .5.  This  is  derived  by  the  following  argument. 
Since  a  was  predetermined  as  a  >  .5  then  (1-P  )n  <  .5  where  P  is  the 
probability  of  failing  the  test  if  the  state  of  nature  is  S«,  and  n  is 
the  number  of  tests.  If  n  =  6  then  P  becomes  .109.  It  seems  very  dif- 
ficult to  believe  however  that  these  tests  are  independent  if  the  same 
real  world  observations  are  to  be  used  in  each  test.  The  second  dif- 
ficulty is  the  method  in  which  the  rank  sums  were  determined.  Instead 
of  considering  an  initial  detection  of  8  miles  differently  than  one  of 
-8  miles,  both  were  given  the  same  rank. 
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TABLE  III 


RANK- 

■SUM  ACCEPTA 

NCE   REGIONS 

WITH  a  =    .109 

R2 

2 

3 

4 

5 

6 

Rl 

2 

- 

7-12 

11-18 

3 

3-8 

7-14 

12-21 

4 

3-10 

8-17 

12-23 

5 

4-12 

8-18 

13-26 

6 

4-13 

9-21 

14-29 

7 

4-15 

10-24 

16-33 

23-42 

8 

5-17 

10-26 

16-35 

24-46 

9 

5-19 

11-28 

18-38 

25-49 

10 

5-20 

12-31 

19-41 

27-53 

11 

6-22 

12-32 

20-44 

29-57 

38-70 

12 

6-24 

13-35 

21-47 

30-60 

40-74 

13 

6-25 

13-37 

22-50 

32-64 

42-78 

14 

7-27 

15-40 

23-53 

33-67 

43-82 

15 

7-29 

15-42 

24-56 

34-70 

45-86 

16 

8-31 

16-45 

25-59 

36-73 

47-90 

17 

8-32 

16-46 

26-62 

37-78 

49-95 

18 

8-34 

17-49 

28-65 

39-82 

51-99 

19 

9-36 

- 

28-67 

40-85 

53-103 

20 

- 

- 

30-71 

41-88 

55-107 

Ro  = 


No.  of  detections  by  model  in  sample  size  20. 

No.  of  detections  in  real  world  exercise,  with  6  runs 


*  By  permission  from  Submarine  ASW  Encounter  Simulation  Model  Detection 
Validation  (u)  by  Systems  Analysis  Office,  ASW  Systems  Project  Office 
(1967). 
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Both  these  difficulties  are  corrected  and  a  more  exact  test  made  in 
the  next  test.  Table  IV  presents  a  summary  of  results  using  this 
testing  procedure. 

TABLE  IV 

SUMMARY  OF  THE  ORIGINAL  WILCOXON  TEST 
RESULTS  USING  THE  SUBMARINE  DATA 

RUN  NUMBER  P  VALUE 

1  greater  than  .5 

2  greater  than  .5 

3  greater  than  .5 

4  greater  than  .5 

5  greater  than  .5 

6  greater  than  .5 

7  greater  than  .5 

8  less  than  .5 

9  greater  than  .5 
10  less  than  .5 

D.   EXACT  WILCOXON  RANK  SUM  TEST 

An  improvement  over  the  original  use  of  the  Wilcoxon  Rank-Sum  Test 
in  validating  this  model  can  be  made  by  not  dividing  the  model  data 
into  subsets  but  rather  considering  it  as  a  sample  of  size  120,  and  by 
determining  the  exact  distribution  of  the  Wilcoxon  Test  statistic.  In 
order  to  determine  the  distribution  of  pQ(x)  it  is  necessary  to  compute 
such  things  as  the  number  of  possible  ways  4  numbers  can  be  sampled 
without  replacement  from  the  positive  integers  1  through  124  such  that 
their  sum  is  always  less  than  or  equal  to  165.  A  recursive  counting 
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procedure  for  large  numbers  like  165  has  been  developed  by  Fix  and 
Hodges  [10]  and  was  used  in  determining  P  values  for  4  of  the  10  inputs. 
As  an  approximation  to  the  test  it  can  be  realized  that  the  ranks  form 
a  finite  population,  thus  the  expected  value  and  variance  of  the  average 
rank  sum  can  be  determined  exactly.  The  distribution  of  the  average 
value  of  the  observed  ranks  minus  its  expected  value  and  divided  by  its 
standard  deviation  is  approximately  the  unit  normal.  Kruskal  and  Wall  is 
[16]  have  suggested  an  addition  correction  for  continuity  when  using 
this  approximation. 

As  an  example  of  the  effect  of  the  normal  approximation  observe  the 
summary  of  P  values  using  the  Wilcoxon  Rank-Sum  Test  in  Table  V. 

These  are  the  first  results  based  on  the  exact  distribution  of 
p«(x),  but  as  shown  by  the  results  in  Table  V  it  is  evident  that  the 
approximations  are  quite  close.  Again  the  inherent  advantages  of  the 
nonparametric  statistics  are  present  but  also  is  the  lack  of  knowledge 
of  p-,(x).  For  a  more  complete  discussion  of  the  efficiency  of  this 
test  see  [5]. 

E.  BINOMIAL  TEST  IN  THE  ORIGINAL  VALIDATION 

The  other  measure  of  effectiveness  used  in  the  statistical  portion 
of  the  validation  of  this  model  is  the  probability  of  detection,  Pd . 
In  the  original  validation  [25]  a  \jery   inexact  test  was  used.  If  m 
model  runs  and  n  real  world  runs  were  to  be  compared  then  the  probabili- 
ties of  each  possible  outcome  were  estimated.  These  probabilities  are 
shown  in  Table  VII  for  n  equal  4  and  m  equal  20. 

To  see  the  inexactness  of  this  test  consider  how  the  probabilities 
are   determined.  The  probability  of  x  detections  in  n  runs  is 


n 

lx; 


px(1_p  )(n-x) 
d    d' 
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TABLE  V 

SUMMARY  OF  SUBMARINE  MODEL  P  VALUES  WITH 
DETECTION  RANGE  DATA  USING  THE  WILCOXON  RANK-SUM  TEST 


INPUT  EXACT               APPROXIMATION 

1  .3140 

2  .0066  .009322 

3  .7900 

4  .7860 

5  .8735  .8728 

6  .2351  .2448 

7  .5666 

8  .3600 

9  .2684 
10  .0952                  .0902 

TABLE  VI 

SUMMARY  OF  P  VALUES  USING  NONPARAMETRIC 
TESTS  ON  SUBMARINE  MODEL  RANGE  OF  DETECTION  DATA 


NPUT 

K-S 

WILCOXON  RANK 

ORIGINAL  RANK 

TEST 

SUM 

TEST 

SUM 

TEST 

APPROX. 

EXACT 

APPROX. 

APPROX. 

1 

.485 

.3140 

> 

.5 

2 

.260 

.0066 

.009322 

> 

.5 

3 

.980 

.7900 

> 

.5 

4 

.998 

.7860 

> 

.5 

5 

.941 

.8735 

.8728 

> 

.5 

6 

.491 

.2351 

.2448 

> 

.5 

7 

.922 

.566 

> 

.5 

8 

.577 

.3600 

< 

.5 

9 

.792 

.2684 

> 

.5 

10 

.202 

.0952 

.0902 

< 

.5 
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If  the  null  hypothesis  is  true  and  the  model  and  real  world  runs  are 
independent  then  the  probability  of  observing  y  out  of  m  detections  from 
the  model  and  x  out  of  n  detections  from  the  real  world  is: 


Pr(x,y)  ■ 


m 


"S"-'d> 


(m-y) 


ix; 


V1^ 


(n-x) 


P  (x,y)  is  a  concave  function  and  by  taking  derivatives  with  respect  to 


P  .,  it  can  be  shown  that  if  0  <  x+y  <  m+n  then: 


Pr(x,y)  1 


m 


x+y 


n+m 


(x+y) 


1  _  *±y_ 

m+n 


(m+n)  -  (x+y) 


P(x,y) 


Thus  P(x,y)  is  an  upper  bound  on  the  probability  that  x  and  y  detections 
will  occur.  P(x,y)  are  the  values  listed  in  Table  VII. 

The  data  has  again  been  divided  into  groups  of  20  and  thus  the 
critical  region  reduced  to  .109  for  each  test.  For  the  particular  values 
of  n  and  m  the  critical  region  has  been  partioned  in  Table  VII.  Should 
a  pair  (x,y)  fall  into  this  region  for  any  of  the  six  tests  then  the 
hypothesis  is  rejected  at  the  .5  significance  level. 

The  primary  objection  to  the  test  besides  the  division  of  model 
observations  into  groups  of  20  is  the  fact  that  each  P(x,y)  is  equal 
to  or  larger  than  its  exact  value  yet  the  size  of  the  critical  region 
is  still  assumed  to  be  .109.  This  would  seem  to  indicate  that  when  the 
null  hypothesis  is  accepted  using  this  test,  it  might  be  rejected  when 
using  a  more  exact  test.  This  is  in  fact  the  case  as  shown  in  Table 
XI,  page  37. 

F.  FISHER  EXACT  TEST 

When  using  the  number  of  detections  divided  by  the  number  of  runs 
to  test  model  validity,  tests  having  more  exact  knowledge  of  P0(x)  are 
also  available.  One  such  test  is  the  Fisher  Exact  Test  based  on  the 
hypergeometric  distribution.  The  P  value  is  determined  by  computing 
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TABLE  VII 
P  (x,y)  VALUES  FOR  n  =  4  AND  m  =  20* 


X 

y 

0 

1 

2 

3 

4 

0 

- 

.062623  : 

■ 

.006144 
-************, 

.000474 

.000021 

i 

.313114 

.081919 

.014194  ! 

.001611 

.000093 

2 

.194556 

.089891 

.022945  | 

.003524 

.000262 

3 

.134835 

.091778 

.031708  ] 

.006277 

.000583 

4 

.097514 

.089873 

.040012  ' 

*>         .009900 

r************T 

.001125 

■ 

5 

.071870 

.085359 

.047517 

.014391  • 

.001973 

6 

.053350 

.079194 

.053965 

.019721  | 

.003230 

7 

.039598 

.071953 

.059163 

.025835  \ 

.005023 

8 

.029231 

.064093 

.062972 

.032647  i 

.007509 

:************ 

9 

.021365 

.055975 

.065294 

.040045 

.010883 

10 

.015393 

.047883 

.066074 

.047883 

.015393 

11 
12 

.010883 
********** 

.007509  : 

.040045 
;    .032647 

.065294 
.062972 

.055975 
.064093 

.021365 
.029231 

13 

.005023 

j    .025835 

.059163 

.071953 

.039598 

14 

.003230  : 

J    .019721 

.053965 

.079194 

.053350 

15 

.001973  • 

.014391 

r************i 

.047517 

.085359 

.071870 

16 

.001125 

.009900  | 

.044012 

.089837 

.097514 

17 

.000583 

.006277  | 

.031708 

.091778 

.134835 

18 

.000262 

.003524  \ 

.022945 

.089891 

.194556 

19 

.000093 

.001611  \ 

i 

.014194 

****#*******: 

.081919 

i 

.313114 

20 

.000021 

.000474 

.006144  : 

.062623 

- 

NOTE:  1.  Table  entries  represent  the  equation: 


P  (x,y)  = 


'20' 


x+y 


24 


x+y 


1  - 


x+y 


24 


24- (x+y) 


2.  P  (x,y)  values  lying  between  shaded  (****)  region  define  the 
acceptance  region. 


*  By  permission  from  Submarine  ASW  Encounter  Simulation  Model  Detection 
Validation  (U)  by  Systems  Analysis  Office,  ASW  Systems  Project  Office 
(1967). 
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TABLE  VIII 


SUMMARY  OF  THE  ORIGINAL  BINOMIAL 
TEST  USING  THE  SUBMARINE  Pd  DATA 


RUN  NUMBER 

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 


P  VALUE 

greater  than  .5 

greater  than  .5 

greater  than  .5 

greater  than  .5 

greater  than  .5 

greater  than  .5 

greater  than  .5 

less  than  .5 

greater  than  .5 

less  than  .5 


the  probability  of  receiving  the  exact  combination  of  model  and  real 
world  detections  as  well  as  any  of  the  more  extreme  combinations. 
Consider  the  data  as  presented  in  Table  IX. 


TABLE  IX 

FISHER  EXACT  TABLEAU  WITH 
SUBMARINE  DATA  FROM  INPUT  7 


NUMBER  OF 
DETECTIONS 


NUMBER  OF 
NON-DETECTIONS 


TOTAL 


REAL  WORLD 
MODEL 


3 
80 


5 
40 


8 
120 


TOTAL 


83 


45 


128 


The  probability  of  receiving  this  combination  of  detections  and  non- 
detections  is: 


83!  45!  8!  120! 
128!  3!  5!  80!  40! 


07905 


For  a  proof  see  [6] 
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The  more  unlikely  combinations,  keeping  the  totals  fixed,  and  their 
probabilities  are  listed  in  Table  X. 

TABLE  X 
MORE  EXTREME  TABLEAUS  IN  FISHER  EXACT  TEST  FROM  INPUT  7 


DET. 

NONDET. 

DET. 

NONDET. 

DET. 

NONDET. 

2 

6 

1 

7 

0 

8 

81 

39 

82 

38 

83 

37 

.01953 

002653 

• 

0001518 

REAL  WORLD 
MODEL 

PROBABILITY 


The  sum  of  all  these  probabilities  is  .10183;  but  this  represents  the 
critical  region  in  only  one  tail  of  P0(x)  and  since  the  alternate 
hypothesis  is  compound  the  sum  must  be  doubled.  P  is  therefore  .20277. 
The  results  of  this  test  with  all  10  inputs  are  listed  in  Table  XI. 

These  exact  probabilities  can  be  wery   tedious  to  compute  and  again 
approximations  are  available.  A  normal  approximation  when  the  sample 
size  in  large  is  described  by  Brownlee  [7],  along  with  guidelines  on 
when  the  approximation  is  valid.   Unfortunately  none  of  the  input 
results  met  the  criterion  but  in  three  cases  they  came  reasonably  close, 
The  results  of  using  the  approximations  are  listed  in  Table  XI. 

Along  with  the  standard  attributes  of  nonparametric  statistics 
the  Fisher  Exact  Test  and  its  normal  approximation  both  have  well 
defined  power,  1-6,  functions  associated  with  them.  1-3  for  input  1 
was  computed  using  the  methods  suggested  by  Brownlee  [8]  and  is  listed 
in  Table  XI  as  .529. 


*  Working  in  reverse  it  has  been  shown  by  Tocher  [27]  that  the  Fisher 
Exact  Test  can  be  used  when  the  conditions  of  the  normal  approximations 
do  not  hold. 
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TABLE  XI 

SUMMARY  OF  SUBMARINE  MODEL  P  VALUES 
USING  THE  FREQUENCY  OF  DETECTION  DATA 


PUT 

FISHER  EXACT 
TEST 

NORMAL 
APPROXIMATION 

1 

.954 

.928 

2 

.932 

3 

.7772 

.668 

4 

.894 

5 

.968 

6 

1.000 

.984 

7 

.203 

8 

.031 

9 

.266 

10 

.101 

ORIGINAL 
POWER       BINONIAL 

.529         >  .5 

>  .5 

>  .5 

>  .5 

>  .5 

>  .5 

>  .5 

<  .5 

>  .5 

<  .5 


For  information  concerning  the  power  function  and  the  Fisher  Exact  Test 
see  [1,14,19].  Thus  for  the  first  time  in  the  tests  described,  p-,(x) 
and  p0(x)  can  be  found. 

G.   SUMMARY  OF  TESTS  WITH  SUBMARINE  DATA 

This  concludes  a  far  from  exhaustive  presentation  of  possible  sta- 
tistical tests  which  could  be  used  in  validating  the  submarine  model. 
Hopefully  the  types  of  assumptions  that  are  necessary  in  nonparametric 
testing  are  reasonably  clear.  Brownlee,  Bradley,  and  Seigel  give  a  far 
more  in-depth  discussion  of  nonparametric  statistics  in  their  texts 
referenced  in  this  section.  For  a  more  complete  discussion  of  the 
power  of  nonparametric  tests  see  [9]  as  well. 

Before  going  on  to  a  parametric  test  and  one  where  dependence  among 
samples  is  considered,  examine  the  difference  in  the  critical  regions 
obtained  by  using  various  tests  requiring  slightly  different  assumptions 
of  the  distribution  of  the  test  statistic,  Tables  VI  and  XI,  pp.  32  and  37 
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While  care  must  be  used  in  explaining  the  cause  of  the  differences, 
it  is  certainly  safe  to  say  that  the  assumptions  of  the  Wilcoxon  Rank 
Test  are  most  closely  adhered  to  while  the  ranking  procedure  of  the 
original  rank  sum  test  and  the  approximation  of  pQ(x)  in  the  Kolmoqorov- 
Smirnov  Test  would  tend  to  discount  the  validity  of  their  results. 
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VI.   PARAMETRIC  AND  NONPARAMETRIC  TESTS  WITH  AIRCRAFT  DATA  BASE 

The  data  to  be  used  in  the  statistical  tests  of  this  chapter  was 
obtained  from  eight  independent  aircraft-submarine  exercises.  In  each 
exercise  aircraft  monitored  a  string  of  eight  sonobuoys  in  an  attempt 
to  gain  and  maintain  the  detection  of  a  transitting  submarine.  All 
exercises  were  made  under  similar  conditions  and  therefore  the  condi- 
tions can  be  considered  identical.  The  measure  of  effectiveness  in 
these  exercises  is  detection  modulus  or  probability  of  detection. 
Detection  modulus,  D.M.,  is  computed  by  dividing  the  total  number  of 
minutes  detection  was  held  by  the  total  number  of  minutes  detection 
could  have  been  held. 

n  M        TIME  DETECTION  WAS  HELD 

U    '  TIME  DETECTION  COULD  HAVE  BEEN  HELD 

For  each  of  the  eight  runs  the  range  and  aspect  of  initial  detection 
for  each  buoy  was  tabulated.  Also  recorded  were  the  range  and  aspect 
at  the  time  of  losing  contact,  and  the  same  information  in  the  event 
that  contact  was  regained.  Because  of  this  extensive  data  base  there 
are  several  random  variables  which  might  be  tested.  All  of  these  fall 
into  two  categories;  however,  those  in  which  the  assumption  is  made 
that  the  samples  are  independent  and  identically  distributed  and  those 
which  assume  only  that  the  samples  are  identically  distributed.  The 
Paired  t,  Uilcoxon,  and  Kolmogorov-Smirnov  Tests  fall  into  the  first 
category  while  the  Davisson  Test  falls  into  the  latter. 

Thus  using  detection  modulus  as  a  measure  of  effectiveness  the  non- 
parametric,  Wilcoxon  and  Kolmogorov-Smirnov  Tests  and  parametric  Paired 
t  and  Davisson  Test  will  be  used  to  demonstrate  the  snectrum  of  tests 
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and  their  characteristics  that  might  be  used  in  validating  a  model  with 
this  type  of  data  base. 

A.  PAIRED  t  TEST 

Since  we  are  testing  the  hypothesis  that  the  real  world  and  the  model 
are  sampling  from  the  same  distribution  it  is  only  natural  to  comoare  the 
differences  in  their  outputs.  Let  d.  represent  the  difference  between 
the  real  world  and  detection  moduli  for  run  i,  and  let 

D  -1  I     d    i  =  l,...,n  . 
i-1 


If  S2  is  defined  as 


1       n 


S2sCT    I     (di^2 
i=l 

then  it  can  be  shown  [13]  that 

.    v£_D 
1  "  S 

is  asymptotically  t  distributed  with  n-1  degrees  of  freedom. 

Since  this  test  assumes  that  the  d.  are  independent,  the  sample  size 
is  eight  and  each  sample  is  the  difference  between  the  real  world  and 
model  estimate  of  the  true  detection  modulus  computed  with  all  eight 
buoys  operating  in  concert.  Table  XII  shows  the  actual  data  and  part 
of  the  calculations.  The  corresponding  P  value  is  approximately  .42. 

The  Paired  t  Test  has  the  advantages  of  the  parametric  tests,  and 
Pq(x)  is  known  exactly  to  be  t/  ,\  for  large  n.  Since  this  distribution 
is  well  tabulated  and  the  arithmetic  is  basic,  the  test  is  easy  to 
perform.  The  test  does  make  some  very   restrictive  assumptions.  The 
independence  assumption  forces  the  aggregation  of  the  data  to  the  extent 
that  much  information  may  be  lost.  The  asymptotic  property  of  the 
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distribution  of  the  test  statistic  adds  another  degree  of  complexity 
for  no  longer  is  pQ(x)  known  exactly.  It  is  known  only  in  the  limit 
as  n  increases. 

TABLE  XII 

INDEPENDENT  AIRCRAFT  DATA  FOR  PAIRED  t  TEST 

RUN        REAL  WORLD        MODEL           d.  d.-D 

l 

i 

1  .4865          .3478          .1387  .1748 

2  .3587          .4023         -.0436  .0075 

3  .2500          .4711         -.2211  .1850 

4  .4134          .5034         -.0900  .0539 

5  .1729          .2967         -.1238  .0877 

6  .3884          .4126         -.0242  .0119 

7  .2601          .2517          .0094  .0445 

8  .3171          .2702          .0469  .0830 


REAL  WORLD 

MODEL 

di 

D.M. 

D.M. 

.4865 

.3478 

.1387 

.3587 

.4023 

-.0436 

.2500 

.4711 

-.2211 

.4134 

.5034 

-.0900 

.1729 

.2967 

-.1238 

.3884 

.4126 

-.0242 

.2601 

.2517 

.0094 

.3171 

.2702 

.0469 

s2  =  1 

a    7 

8 

I 
i-1 

(drD)2  = 

.08344 
7 

=  .012635 

t 

n/F  D 
S 

s[Q     (-.0361) 
.1124 

t  = 

-.9084 

B.  K0LM0G0R0V-SMIRN0V  TEST 

Another  way  of  comparing  the  differences  in  the  samples  is  by  the 
Kolmogorov-Smirnov  Test.  The  relative  merits  of  the  test  have  been  dis- 
cussed, but  this  data  presents  an  opportunity  to  compare  the  exact  pn(x) 
for  the  Kolmogorov-Smirnov  Test  to  the  previously  used  approximation  of 
pn(x).  Using  the  data  given  in  Table  XII,  the  maximum  deviation  is 
.25,  and  by  the  previous  approximation  to  pn(x)  the  corresponding  P 
value  is  .9639  whereas  by  Massey's  exact  computation  [18]  the  P  value 
is  .6602.  This  very  large  difference  indicates  the  dangers  in  using 
this  approximation  to  pfi(x). 
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C.  WILCOXON  TEST  WITH  NORMAL  APPROXIMATION 

The  Wilcoxon  Test  when  performed  on  the  data  of  Table  XII  and  with 
use  of  the  normal  approximation  to  pQ(x)  yields  a  P  value  of  .46.  The 
relative  merits  of  this  test  are  the  same  as  described  previously,  and 
the  results  are  given  only  for  comparative  purposes. 

D.  DAVISSON  TEST  WITH  DEPENDENCE 

In  the  past  three  tests  the  sample  size  was  eight  due  to  the  fact 
that  independence  among  samples  was  required  by  each  test.  What  if 
one  wanted  to  compare  the  average  detection  modulus  of  each  buoy  on 
each  run  or  perhaps  the  average  detection  modulus  in  each  five  mile 
range  band  from  -50  to  50  miles  for  each  buoy  on  each  run?  In  these 
cases  and  the  many  others  that  might  be  considered  the  values  are 
dependent  on  each  other  and  thus  none  of  the  assumptions  of  the  tests 
mentioned  so  far  are  completely  satisfied. 

Davisson  has  shown  that  by  comparing  certain  differences  between  the 
real  world  and  model  results  that  the  maximum  likelihood  ratio  yields  a 
statistic  with  a  known  distribution  [11]. 

Since  the  test  is  very   tedious  only  a  relatively  short  comparison 
with  the  aircraft  data  will  be  given.  Consider  the  detection  moduli 
of  buoys  3,  4,  and  5  on  each  run.  The  random  variable  to  be  tested  is 
the  average  detection  modulus  of  each  buoy.  Thus  the  null  hypothesis 
is  that  the  average  detection  moduli  for  buoys  3,  4,  and  5  are  the  same 
in  the  real  world  as  they  are  in  the  model  and  that  their  interdependence 
is  also  identical . 

The  first  step  in  determining  the  test  statistic  is  to  comDute  the 
variance-covariance  matrix  of  the  computer's  average  detection  moduli. 
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To  do  this  the  average  detection  modulus  for  each  run  on  a  buoy  is 
subtracted  from  the  average  for  that  buoy  over  all  runs. 
The  results  are  shown  in  Tables  XIII  and  XIV. 


TABLE  XIII 
AVERAGE  BUOY  DETECTION  MODULI  FOR  THE  MODEL  AND  THEIR  AVERAGES 


RUN 


AVERAGE 


BUOY 

3 

4 

5 

1 

.240755 

.560000 

.491032 

2 

.365082 

.457377 

.595555 

3 

.104921 

.596825 

.514203 

4 

.584210 

.941052 

.782857 

5 

.051273 

.164909 

.158269 

6 

.496562 

.375031 

.181154 

7 

.038730 

.465397 

.579434 

8 

.209259 

.563148 

.468054 

261349 


513217 


.468054 


RUN 


TABLE  XIV 
AVERAGE  MODEL  DETECTION  MODULI  MINUS  THEIR  AVERAGES 


BUOY 

3 

4 

5 

1 

-.020594 

.046783 

-.049022 

2 

.103733 

-.055840 

.127501 

3 

-.156428 

.083608 

.046149 

4 

.322862 

.427835 

.314803 

5 

-.210076 

-.348308 

-.309785 

6 

.235214 

-.156186 

-.286900 

7 

-.222619 

-.047820 

.111380 

8 

-.052090 

.049931 

.045875 

The  transpose  of  the  8x3  matrix  in  Table  XIV  when  multiplied  by 

itself  yields  the  variance-covariance  matrix  0.  The  Q  matrix  is  shown 

in  Table  XV. 
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TABLE  XV 

VARIANCE- 

-COVARIANCE 

MATRIX 

Q 

036453 

.020347 

.0098831 

020347 

.043229 

.034850 

0098831 

.034850 

.039085 

Now  a  difference  vector  M  is  computed  with  each  component  being  equal 
to  the  difference  between  the  average  real  world  detection  modulus 
overall  eight  runs  and  the  corresponding  results  from  the  model. 


TABLE  XVI 

DIFFERENCE  VECTOR  M 

REAL  WORLD 

MODEL 

M 

AVERAGE 

AVERAGE 

.449466 

.261349 

.1881 

.365698 

.513217 

-.1475 

.539570 

.468054 

.0715 

Davisson  has  stated  [11]  that  the  distribution  of 

MTQ_1M 
is  asymptotically  chi -squared  with  N  degrees  of  freedom  where  N  is  the 
dimension  of  Q.  In  this  case  M  Q~  M  is  9.7682  and  the  corresponding 
P  value  is  .02. 

As  was  the  case  with  the  Paired  t  Test,  this  test  has  the  advantages 
of  being  parametric  but  the  disadvantages  of  its  asymptotic  properties 
and  lack  of  knowledge  of  p,(x).  The  main  drawback  of  the  Davisson  Test 
is  its  computational  difficulty.  As  the  dimension  of  Q  increases  a 
large  computer  becomes  necessary  and  the  sorting  of  data  becomes  quite 
tedious.  Care  must  also  be  taken  that  accuracy  is  not  lost  in  the 
inversion  of  Q  and  that  subsets  are  chosen  such  that  Q  is  not  singular. 
In  spite  of  all  these  disadvantages,  the  relief  from  the  independence 
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assumption  is  very   advantageous.  If  tolerance  of  its  assumptions 
permits  its  use,  the  Davisson  Test  will  yield  a  more  detailed  valida- 
tion test.  It  is  now  possible  to  reject  part  of  the  model  while 
accepting  the  rest,  thus  allowing  trouble-shooting  for  the  simulation 
analysts.  This  feature  was  also  possible  with  the  submarine  model  but 
only  because  10  different  inputs  were  sampled  and  thus  data  collection 
had  to  be  more  extensive  and  also  more  costly. 

E.  SUMMARY  OF  TESTS  WITH  AIRCRAFT  DATA 

The  P  values  corresponding  to  each  of  the  four  tests  applied  to  the 
aircraft  data  are  listed  in  Table  XVII. 

TABLE  XVII 

SUMMARY  OF  P  VALUES  USING  PARAMETRIC  AND 
NONPARAMETRIC  TESTS  ON  THE  AIRCRAFT  MODEL  DATA 

PAIRED  t       K0LM0G0R0V-SMIRN0V        WILC0X0N     DAVISSON 
EXACT        APPROX. 

.42         .6602         .9639         .46         .02 

It  is  not  appropriate  to  compare  the  results  of  the  Davisson  Test 
to  those  of  the  other  tests  due  to  its  unique  properties,  nor  is  it 
feasible  to  pass  judgement  on  the  remaining  tests  solely  on  the  results 
in  Table  XVII.  It  should  be  noted  however  that  the  Kolmogorov-Smirnov 
and  Wilcoxon  Test  results  are  based  on  exact  knowledge  of  pQ(x)  while 
the  Paired  t  Test  and  the  approximate  Kolmogorov-Smirnov  Test  are  not, 
and  that  no  additional  knowledge  of  p-,(x)  is  obtained  by  using  these 
approximations.  While  the  distribution  of  the  Davisson  Test  statistic 
is  not  exact  nor  is  information  about  p-,(x)  available,  it  does  allow  a 
more  localized  validation  thereby  allowing  "trouble-shooting"  which 
the  other  tests  do  not  permit. 
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As  with  the  submarine  data,  these  tests  are  far  from  an  exhaustive 
set  of  all  those  possible.  They  were  chosen  to  represent  the  range  and 
spectrum  of  assumptions  needed  to  perform  the  validation  of  this  type 
model  with  its  data  base. 
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VII.  SUMMARY  AND  CONCLUSIONS 

This  paper  has  investigated  the  most  salient  problems  of  present  day 
validation  procedures  and  alleviated  them  by  enlarging  the  scope  of 
validation  and  by  describing  what  is  needed  and  can  be  expected  from 
a  statistical  test  with  "validation  type"  data.  It  was  shown  that 
decision  theory  and  cost  analysis  while  present  in  previous  validations 
received  no  mention,  and  that  statistical  testing  with  its  pass  or  fail 
results  did  not  allow  the  decision  maker  much  flexibility.  While  only 
two  simple  decision  rules  and  one  type  of  decision  criterion  were 
presented,  it  became  obvious  that  by  determining  P  values  from  several 
tests  and  by  trying  to  do  such  things  as  minimizing  expected  cost,  the 
decision  maker  could  avail  himself  of  more  information  and  have  the 
capability  to  change  more  elements  in  his  decision  rule. 

A  general  methodology  for  the  statistical  testing  of  validation 
data  was  also  discussed.  Included  in  the  methodology  are  the  goals  of 
a  "validation  test,"  the  types  of  tests  available  with  their  inherent 
assumptions  and  properties,  the  need  for  multiple  testing,  and  the 
pitfalls  of  relaxing  assumptions  within  a  test. 

It  was  seen  that  while  a  myriad  of  possible  tests  exists,  those 
having  exact  knowledge  of  p0(x)  and  p-,(x)  will  be  the  best.  But,  since 
p,(x)  is  seldom  known  due  to  the  nature  of  the  alternate  hypothesis  and 
calculation  procedures  necessitate  approximations  to  pQ(x)  in  many 
cases,  these  desirable  tests  are  not  always  available.  Some  tests  are 
clearly  better  than  others,  but  in  general,  it  was  seen  that  several 
tests  using  different  assumptions  should  be  used  to  achieve  the  most 
reliable  information  about  P  and  6. 

47 


In  conclusion  the  problems  of  validation  are  analogous  to  those  of 
systems  analysis  and  cost-effectiveness.  The  goal  or  criterion  can  be 
defined  as  minimization  of  expected  cost  for  a  fixed  level  of  validity, 
yet  the  methods  of  exact  determination  are  not  as  well  defined  and  need 
to  be  considered  in  concert  instead  of  individually.  In  the  past,  one 
of  the  methods  was  statistical  testing.  When  used  alone  there  existed 
reasons  to  criticize  the  validations  but  when  used  in  the  procedure  as 
presented  in  this  paper,  the  validator  has  more  flexibility  and  is  able 
to  use  more  information  from  his  data  and  other  sources. 

Another  important  advantage  of  this  procedure  is  the  increased 
ability  to  see  the  effects  of  changes  in  a  decision  rule.  All  that 
could  be  seen  previously  was  that  at  a  significance  level  of  .6  the 
model  was  considered  invalid  but  at  a  .4  level  it  was  not.  Now  such 
things  as  the  changes  in  a  decision  rule  caused  by  refusing  to  accept 
a  priori  knowledge  of  the  states  of  nature  can  be  observed. 

So  just  as  was  done  with  systems  analysis  a  new  approach  or  way  of 
looking  at  a  problem  has  been  proposed.  This  time  it  is  to  help  the 
decision  maker  with  his  important  and  complex  problems  of  model 
validation. 
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VIII.  AREAS  FOR  FUTURE  STUDY 

Since  this  paper  represents  a  pilot  study  in  the  expansion  of  model 
validation,  almost  any  facet  of  the  paper  could  and  should  be  expanded. 

The  area  of  simulation  theory  is  normally  not  considered  an  O.R. 
problem  at  least  in  the  context  of  calibrating  the  model.  The  search 
for  more  nearly  perfect  statistical  tests  is  also  considered  as  second 
in  importance  to  the  development  of  decision  rules  applicable  to  model 
validation. 

After  several  decision  rules  have  been  presented  then  case  studies 
similar  to  those  of  systems  analysis  will  make  a  valuable  contribution 
to  the  field  of  validation. 
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